Take Home Exercise 03: Prototyping Modules for Geospatial Analytics Shiny Application

Author

Pan Mingwei

Published

October 18, 2024

Modified

October 26, 2024

1. Overview

In this take-home exercise, I will focus on prototyping Geographically Weighted Regression (GWR) models for my group’s Shiny App. GWR is a spatial statistical method that accounts for non-stationary variables (such as climate, demographics, and physical environment characteristics) and models the local relationships between these independent variables and the outcome of interest. In this case, the dependent variable is the rental price of HDB flats in Singapore, and I will examine how factors such as flat size, proximity to MRT and CBD, remaining lease, storey height, and more influence HDB rental prices.The data preparation and Exploratory Data Analysis were handled by my groupmate, so for this exercise, I will load the data directly from an RDS file.

2 The R-Packages

Some important packages, we going to use in the exercise:

  • olsrr: performing disgnostics test
  • tidyverse: attribute data handling
  • GWmodel: build GWR model
  • sf: spatial data handling
  • tmap: choropleth mapping
  • corrplot: visual exploratory tool on correlation matrix
pacman::p_load(olsrr, corrplot, sf, spdep, GWmodel, tmap, tidyverse, gtsummary)

3. The Data

3.1 Aspatial Data

First, import the rental dataset, as the data wrangling was done by teammate. Please refer to here for details.

  • rental.sf => contains the rental data from Jan 2020 to Sept 2024, as well as other fields like:

    • Dependent:

      • Monthly Rental feemonthly_rent
    • Continuous:

      • Proximity measure: kindergarten, childcare, hawker, bus stops, shopping mall, mrt, primary schools, cbd

      • Count of amenities within specific distance: kindergarten, childcare, hawker, bus stops, shopping mall,

    • Categorical:

      • Flat Typeflat_type

      • Towntown

      • Regionregion

rental.sf <- read_rds("data/rds/rental_sf.rds")
head(rental.sf)
Simple feature collection with 6 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 15786.61 ymin: 30769.77 xmax: 39668.39 ymax: 45634.94
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 19
  rent_approval_date town       flat_type monthly_rent            geometry
  <date>             <chr>      <chr>            <dbl>         <POINT [m]>
1 2024-01-01         YISHUN     4-ROOM            3700 (29463.95 45634.94)
2 2024-01-01         JURONG WE… 4-ROOM            3900 (15786.61 34389.18)
3 2024-01-01         SENGKANG   5-ROOM            3700  (33244.8 42043.31)
4 2024-01-01         KALLANG/W… 3-ROOM            3600 (30102.41 32911.75)
5 2024-01-01         BEDOK      5-ROOM            4350 (39668.39 33918.01)
6 2024-01-01         QUEENSTOWN 4-ROOM            3000 (25265.85 30769.77)
# ℹ 14 more variables: region <chr>, no_of_kindergarten_500m <dbl>,
#   prox_kindergarten <dbl>, no_of_childcare_500m <dbl>, prox_childcare <dbl>,
#   no_of_hawker_500m <dbl>, prox_hawker <dbl>, no_of_busstop_500m <dbl>,
#   prox_busstop <dbl>, no_of_shoppingmall_1km <dbl>, prox_shoppingmall <dbl>,
#   prox_mrt <dbl>, prox_prisch <dbl>, prox_cbd <dbl>
st_crs(rental.sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Notice that our rental.sf was in EPSG 3414.

3.2 Geospatial Data

Using st_read() of sf package to import the MP14_SUBZONE_WEB_PL shapefile.

mpsz = st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `/Users/mingwei/Desktop/SMU/Y3S1/IS415/xXxPMWxXx/IS415-GAA/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

As the aspatial data we going to use was in EPSG=3414, the code chunk below will transform mpsz object to ESPG code = 3414 using st_transform() method of sf package.

mpsz_svy21 <- st_transform(mpsz, 3414)
st_crs(mpsz_svy21)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

Notice that the EPSG is now in 3414, same as rental.sf. We are good to go.

4. Hedonic Pricing Modelling

4.1 Simple Linear Regression Method

To start off, lest build a simple linear regression model by using monthly_rent as the dependent variable. Since the dataset, only have flat type, for exploring, I will use prox_mrt as the independent variable.

Note

I will try out simple linear regression, but I won’t dive deeply into it, as this method will not be part of our Shiny App. Instead, I will focus more on multiple linear regression method and building Hedonic Pricing Models using the GWmodel package in the next section. Which I will explore the different arguments available so that we can include them in our Shiny App.

hdb.slr <- lm(formula=monthly_rent ~ prox_mrt, data = rental.sf)

lm() returns an object of class “lm” or for multiple responses of class c(“mlm”, “lm”).

The functions summary() and anova() can be used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

summary(hdb.slr)

Call:
lm(formula = resale_price ~ floor_area_sqft, data = resale.sf)

Residuals:
    Min      1Q  Median      3Q     Max 
-469929  -86856  -34413   44193  893943 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     103111.3     3811.1   27.05   <2e-16 ***
floor_area_sqft    490.2        3.6  136.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 137200 on 21560 degrees of freedom
Multiple R-squared:  0.4623,    Adjusted R-squared:  0.4623 
F-statistic: 1.854e+04 on 1 and 21560 DF,  p-value: < 2.2e-16

The output report reveals that the monthly_rent can be explained by using the formula:

monthly_rent = 3200.84161 - 0.20003(prox_mrt)

The coefficients section, the p-value for the hypothesis test that the coefficient is equal to zero. Since both values are less than 0.001, both the intercept and the floor area are statistically significant 

With the multiple R-squared of 0.01045, Indicates that about 1.04% of the variability in rental prices is explained by the model. This suggests that other factors likely influence rental prices, as over 99% of the variability remains unexplained.

Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of monthly_rent. This will allow us to infer that simple linear regression model above is a good estimator of monthly_rent.

To visualise the best fit curve on a scatterplot, we can incorporate lm() as a method function in ggplot’s geometry as shown in the code chunk below.

Click to expand/collapse code
ggplot(data=rental.sf,  
       aes(x=`prox_mrt`, y=`monthly_rent`)) +
  geom_point() +
  geom_smooth(method = lm)

We can see from above that, using prox_mrt variable is not really accurate for simple linear regression model.

4.2 Multiple Linear Regression

4.2.1 Visualising the Relationships of the Independent Variables

Before constructing a multiple regression model, it’s crucial to verify that the independent variables are not highly correlated with one another. Using highly correlated independent variables by mistake can undermine the model’s quality. This issue is referred to as multicollinearity in statistics.

A correlation matrix is often utilized to visualize the relationships among independent variables. In addition to R’s pairs() function, there are several packages available that facilitate the display of a correlation matrix. In this section, we will use the corrplot package.

First, lets check the column for the rental.sf.

names(rental.sf)
 [1] "rent_approval_date"      "town"                   
 [3] "flat_type"               "monthly_rent"           
 [5] "geometry"                "region"                 
 [7] "no_of_kindergarten_500m" "prox_kindergarten"      
 [9] "no_of_childcare_500m"    "prox_childcare"         
[11] "no_of_hawker_500m"       "prox_hawker"            
[13] "no_of_busstop_500m"      "prox_busstop"           
[15] "no_of_shoppingmall_1km"  "prox_shoppingmall"      
[17] "prox_mrt"                "prox_prisch"            
[19] "prox_cbd"               

First, lets pick all the numeric independent variables.

independent_columns <- rental.sf %>% 
  select(7:19) %>%
  st_drop_geometry()

The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in rental.sf data.frame.

corrplot(cor(independent_columns), diag = FALSE, order = "AOE", 
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

corrplot(cor(independent_columns), diag = FALSE, order = "FPC", 
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

corrplot(cor(independent_columns), diag = FALSE, order = "AOE", 
         tl.pos = "ld", tl.cex = 0.5, method = "number", type = "lower")

corrplot(cor(independent_columns), diag = FALSE, order = "AOE", 
         tl.pos = "full", tl.cex = 0.5, method = "number", type = "full")

corrplot(cor(independent_columns), diag = FALSE, order = "AOE", 
         tl.pos = "td", tl.cex = 0.5, method = "circle", type = "upper")

After reviewing the documentation for the corrplot package and examining the plot above, I believe it would be beneficial to create a page where users can interact with various arguments to explore the correlation matrix of the independent variables.

There are different input/option for the arguments such as:

  • method

    • circle (default)

    • square

    • ellipse

    • number

    • pie

    • shade

    • color

  • type

    • full (default) => tl.pos need to set as “full”, cannot set as “td” as type = “upper”

    • upper => tl.pos = “td”

    • lower => tl.pos = “ld”

  • order

    • original (default) => orginal order

    • AOE => angular order of the eigenvectors

    • FPC => first principal component order

    • hclust => for the hierarchical clustering order

      • hclust.method: when the order is hclust, the below method can be define

        • ‘ward’ , ward.D’, ‘ward.D2’, ‘single’, ‘complete’, ‘average’, ‘mcquitty’, ‘median’ or ‘centroid’
    • alphabet => alphabetical order

4.2.2 Building the Hedonic Pricing Model using Multiple Linear Regression Method

names(independent_columns)
 [1] "floor_area_sqft"               "distance_to_mrt_meters"       
 [3] "distance_to_cbd"               "distance_to_pri_school_meters"
 [5] "housing_type"                  "remaining_lease_total_months" 
 [7] "remaining_lease_years"         "floor_7_to_9"                 
 [9] "floor_4_to_6"                  "floor_1_to_3"                 
[11] "floor_16_and_above"            "floor_10_to_12"               
[13] "floor_13_to_15"                "no_of_kindergarten_500m"      
[15] "prox_kindergarten"             "no_of_childcare_500m"         
[17] "prox_childcare"                "no_of_hawker_500m"            
[19] "prox_hawker"                   "no_of_busstop_500m"           
[21] "prox_busstop"                  "no_of_shoppingmall_1km"       
[23] "prox_shoppingmall"            
Click to expand/collapse code
rental_mlr <- lm(formula = monthly_rent ~ no_of_kindergarten_500m + prox_kindergarten + 
                   no_of_childcare_500m + prox_childcare + no_of_hawker_500m 
                 + prox_hawker + no_of_busstop_500m + prox_busstop
                 + no_of_shoppingmall_1km + prox_shoppingmall + prox_mrt
                 + prox_prisch + prox_cbd,
                 data = rental.sf)

summary(rental_mlr)

Call:
lm(formula = monthly_rent ~ no_of_kindergarten_500m + prox_kindergarten + 
    no_of_childcare_500m + prox_childcare + no_of_hawker_500m + 
    prox_hawker + no_of_busstop_500m + prox_busstop + no_of_shoppingmall_1km + 
    prox_shoppingmall + prox_mrt + prox_prisch + prox_cbd, data = rental.sf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2621.8  -377.5    15.4   415.4  3361.9 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3.432e+03  3.317e+01 103.469  < 2e-16 ***
no_of_kindergarten_500m  7.354e+00  3.579e+00   2.055   0.0399 *  
prox_kindergarten       -4.321e-02  2.791e-02  -1.548   0.1216    
no_of_childcare_500m     4.975e-01  1.449e+00   0.343   0.7314    
prox_childcare          -4.983e-02  4.756e-02  -1.048   0.2948    
no_of_hawker_500m       -4.873e+01  6.515e+00  -7.479 7.72e-14 ***
prox_hawker              4.780e-02  1.078e-02   4.432 9.36e-06 ***
no_of_busstop_500m       7.486e+00  1.057e+00   7.085 1.42e-12 ***
prox_busstop             1.836e-01  7.566e-02   2.426   0.0153 *  
no_of_shoppingmall_1km  -6.699e+00  3.549e+00  -1.887   0.0591 .  
prox_shoppingmall       -1.216e-01  1.415e-02  -8.598  < 2e-16 ***
prox_mrt                -1.248e-01  1.327e-02  -9.411  < 2e-16 ***
prox_prisch              8.372e-04  1.588e-02   0.053   0.9580    
prox_cbd                -2.798e-02  1.118e-03 -25.034  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 646.5 on 25699 degrees of freedom
Multiple R-squared:  0.03757,   Adjusted R-squared:  0.03708 
F-statistic: 77.17 on 13 and 25699 DF,  p-value: < 2.2e-16

Based on the coefficients section, we can see that not all the independent variables are statistically significant, some variable we can remove from our model. Based on the Pr (p-value) field, we can remove prox_kindergarten, no_of_childcare_500m, prox_childcare, no_of_shoppingmall_1km and prox_prisch field.

Notice the adjused R-squared value is only 0.03708.

4.2.3 Preparing Publication Quality Table: olsrr method

Lets. update the model by removing the 4 variables which are not statistically significant and display the Publication Quality Table using ols_regress() of olsrr package

Click to expand/collapse code
rental_mlr <- lm(formula = monthly_rent ~ no_of_kindergarten_500m + 
                 + no_of_hawker_500m 
                 + prox_hawker + no_of_busstop_500m + prox_busstop
                 + prox_shoppingmall + prox_mrt
                 + prox_cbd,
                 data = rental.sf)
ols_regress(rental_mlr)
                           Model Summary                             
--------------------------------------------------------------------
R                         0.193       RMSE                  646.509 
R-Squared                 0.037       MSE                418120.408 
Adj. R-Squared            0.037       Coef. Var              20.847 
Pred R-Squared            0.036       AIC                405798.183 
MAE                     498.559       SBC                405879.730 
--------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                       Sum of                                                   
                      Squares           DF     Mean Square       F         Sig. 
--------------------------------------------------------------------------------
Regression      414827561.762            8    51853445.220    124.016    0.0000 
Residual      10747366972.513        25704      418120.408                      
Total         11162194534.275        25712                                      
--------------------------------------------------------------------------------

                                            Parameter Estimates                                             
-----------------------------------------------------------------------------------------------------------
                  model        Beta    Std. Error    Std. Beta       t        Sig        lower       upper 
-----------------------------------------------------------------------------------------------------------
            (Intercept)    3386.962        27.314                 124.001    0.000    3333.425    3440.499 
no_of_kindergarten_500m      10.715         2.841        0.024      3.771    0.000       5.146      16.284 
      no_of_hawker_500m     -48.313         6.504       -0.062     -7.428    0.000     -61.061     -35.565 
            prox_hawker       0.050         0.011        0.038      4.646    0.000       0.029       0.071 
     no_of_busstop_500m       7.354         1.029        0.050      7.146    0.000       5.337       9.371 
           prox_busstop       0.181         0.076        0.015      2.391    0.017       0.033       0.329 
      prox_shoppingmall      -0.109         0.011       -0.065     -9.595    0.000      -0.131      -0.087 
               prox_mrt      -0.125         0.013       -0.064     -9.831    0.000      -0.150      -0.100 
               prox_cbd      -0.028         0.001       -0.192    -25.274    0.000      -0.030      -0.026 
-----------------------------------------------------------------------------------------------------------

4.2.4 Testing

To ensure that our multiple linear regression model are good to go. We need to perform the following test:

  • Test of Multicollinearity

  • Test of Non-Linearity

  • Test for Normality Assumption

  • Test for Spatial Autocorrelation

4.2.4.1 Test of Multicollinearity

The code chunk below, the ols_vif_tol() of olsrr package is used to test if there are sign of multicollinearity.

To check if the independent variables in the model are highly correlated with each other. They should not be highly correlated, as it will leads to less reliable coefficient estimates.

ols_vif_tol(rental_mlr)
                Variables Tolerance      VIF
1 no_of_kindergarten_500m 0.9497317 1.052929
2       no_of_hawker_500m 0.5330345 1.876051
3             prox_hawker 0.5725711 1.746508
4      no_of_busstop_500m 0.7554615 1.323694
5            prox_busstop 0.9180374 1.089280
6       prox_shoppingmall 0.8053481 1.241699
7                prox_mrt 0.8860694 1.128580
8                prox_cbd 0.6468219 1.546021

Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.

4.2.4.2 Test for Non-Learity

To check the relationship between the dependent variable (e.g., rental price) and the independent variables (e.g., floor area, distance to MRT) is linear. If the relationship is non-linear, the model will not accurately capture the patterns in the data.

The code chunk below, the ols_plot_resid_fit() of olsrr package is used to perform linearity assumption test.

ols_plot_resid_fit(rental_mlr)

The figure above reveals that most of the data poitns are scattered around the 0 line, hence we can safely conclude that the relationships between the dependent variable and independent variables are linear.

4.2.4.3 Test for Normality Assumption

The residuals (the differences between observed and predicted values) in linear regression are assumed to be normally distributed. If the residuals deviate significantly from normality, it may indicate problems such as outliers or influential points.

The code chunk below uses ols_plot_resid_hist() of olsrr package to perform normality assumption test.

ols_plot_resid_hist(rental_mlr)

As we can see, the residual of the multiple linear regression model resembles a normal distribution.

4.2.4.4 Test for Spatial Autocorrelation

In spatial data, nearby observations may not be independent of each other, violating the assumption of independent observations. If residuals are spatially autocorrelated, it suggests that the model is missing key spatial relationships, leading to biased or inefficient estimates. 

mlr.output <- as.data.frame(rental_mlr$residuals)
rental.res.sf <- cbind(rental.sf, 
                        rental_mlr$residuals) %>%
rename(`MLR_RES` = `rental_mlr.residuals`)

let’s display the distribution of the residuals on an interactive map:

tmap_mode("view")
tm_shape(mpsz_svy21)+
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
tm_shape(rental.res.sf) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")

5. Building Hedonic Pricing Models using GWmodel

5.1 Build Adaptive Bandwidth GWR Model

The code chunk below is using bw.gwr() of GWModel package is used to determine the optimal fixed bandwidth to use in the model. Notice that the argument adaptive is set to TRUE indicates that we are interested to compute the adaptive bandwidth.

Since calculating the bandwidth appears to take a long time, it may not be ideal to allow users to select the option and compute it in real-time in our Shiny App. Therefore, I plan to pre-compute the bandwidth values and store them in the backend. This way, when users select the bandwidth type (fixed or adaptive), approach, and kernel, they won’t have to wait long for the results.

Click to expand/collapse code
bw.adaptive <- bw.gwr(formula = monthly_rent ~ no_of_kindergarten_500m + 
                 + no_of_hawker_500m 
                 + prox_hawker + no_of_busstop_500m + prox_busstop
                 + prox_shoppingmall + prox_mrt
                 + prox_cbd,
                   data=rental.res.sf, 
                   approach="CV", 
                   kernel="gaussian", 
                   adaptive=TRUE, 
                   longlat=FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth: 15899 CV score: 10651256638 
Adaptive bandwidth: 9834 CV score: 10562030485 
Adaptive bandwidth: 6085 CV score: 10422218998 
Adaptive bandwidth: 3768 CV score: 10243862737 
Adaptive bandwidth: 2336 CV score: 10006730855 
Adaptive bandwidth: 1451 CV score: 9809341570 
Adaptive bandwidth: 904 CV score: 9629318132 
Adaptive bandwidth: 566 CV score: 9469206997 
Adaptive bandwidth: 357 CV score: 9237980862 
Adaptive bandwidth: 227 CV score: 8930582153 
Adaptive bandwidth: 148 CV score: 8749757792 
Adaptive bandwidth: 97 CV score: 8579941910 
Adaptive bandwidth: 68 CV score: 8464951724 
Adaptive bandwidth: 47 CV score: 1.550042e+33 
Adaptive bandwidth: 77 CV score: 8498886660 
Adaptive bandwidth: 58 CV score: 8415559067 
Adaptive bandwidth: 56 CV score: 8404607970 
Adaptive bandwidth: 50 CV score: 1.550042e+33 
Adaptive bandwidth: 54 CV score: 8395864347 
Adaptive bandwidth: 59 CV score: 8419866277 
Adaptive bandwidth: 57 CV score: 8410563166 
Adaptive bandwidth: 58 CV score: 8415559067 
Adaptive bandwidth: 57 CV score: 8410563166 
Adaptive bandwidth: 57 CV score: 8410563166 
Adaptive bandwidth: 56 CV score: 8404607970 
Adaptive bandwidth: 56 CV score: 8404607970 
Adaptive bandwidth: 55 CV score: 8399958858 
Adaptive bandwidth: 55 CV score: 8399958858 
Adaptive bandwidth: 54 CV score: 8395864347 

The result shows that number of recommended data point, which we will use for generating the adaptive bandwidth GWR model.

5.2 Constructing the Adaptive Bandwidth GWR Model

Notice that the below code chunk, for gwr.basic() method, it does not have the arguement like approach.

Click to expand/collapse code
gwr.adaptive <- gwr.basic(formula = monthly_rent ~ no_of_kindergarten_500m
                 + no_of_hawker_500m 
                 + prox_hawker + no_of_busstop_500m + prox_busstop
                 + prox_shoppingmall + prox_mrt
                 + prox_cbd,
                   data=rental.res.sf, bw=54,
                   kernel="gaussian", 
                   adaptive=TRUE, 
                   longlat=FALSE)
gwr.adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-10-26 11:40:33.151911 
   Call:
   gwr.basic(formula = monthly_rent ~ no_of_kindergarten_500m + 
    no_of_hawker_500m + prox_hawker + no_of_busstop_500m + prox_busstop + 
    prox_shoppingmall + prox_mrt + prox_cbd, data = rental.res.sf, 
    bw = 54, kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  monthly_rent
   Independent variables:  no_of_kindergarten_500m no_of_hawker_500m prox_hawker no_of_busstop_500m prox_busstop prox_shoppingmall prox_mrt prox_cbd
   Number of data points: 25713
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-2655.7  -377.5    14.6   415.2  3374.2 

   Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
   (Intercept)              3.387e+03  2.731e+01 124.001  < 2e-16 ***
   no_of_kindergarten_500m  1.072e+01  2.841e+00   3.771 0.000163 ***
   no_of_hawker_500m       -4.831e+01  6.504e+00  -7.428 1.13e-13 ***
   prox_hawker              4.983e-02  1.072e-02   4.646 3.39e-06 ***
   no_of_busstop_500m       7.354e+00  1.029e+00   7.146 9.17e-13 ***
   prox_busstop             1.809e-01  7.562e-02   2.391 0.016790 *  
   prox_shoppingmall       -1.087e-01  1.133e-02  -9.595  < 2e-16 ***
   prox_mrt                -1.251e-01  1.272e-02  -9.831  < 2e-16 ***
   prox_cbd                -2.765e-02  1.094e-03 -25.274  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 646.6 on 25704 degrees of freedom
   Multiple R-squared: 0.03716
   Adjusted R-squared: 0.03686 
   F-statistic:   124 on 8 and 25704 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 10747366973
   Sigma(hat): 646.5343
   AIC:  405798.2
   AICc:  405798.2
   BIC:  380268.3
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 54 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                  Min.     1st Qu.      Median     3rd Qu.
   Intercept               -3.0066e+05 -1.3166e+03  3.1654e+03  8.1646e+03
   no_of_kindergarten_500m -2.5669e+03 -8.9269e+01 -1.2131e+01  6.1644e+01
   no_of_hawker_500m       -6.8129e+03 -1.7351e+02 -1.9081e+01  1.5754e+02
   prox_hawker             -7.3690e+01 -5.1075e-01  1.0307e-01  7.8401e-01
   no_of_busstop_500m      -2.9869e+02 -2.0997e+01 -6.1941e-01  1.9823e+01
   prox_busstop            -1.5168e+01 -8.6745e-01  3.5281e-02  1.0984e+00
   prox_shoppingmall       -3.1216e+01 -6.9688e-01 -8.7709e-02  5.3366e-01
   prox_mrt                -7.2589e+01 -8.2109e-01 -2.1153e-01  4.8280e-01
   prox_cbd                -1.8755e+01 -5.2624e-01  5.7581e-03  3.8363e-01
                                 Max.
   Intercept               260730.122
   no_of_kindergarten_500m   7363.101
   no_of_hawker_500m         9147.636
   prox_hawker                 26.525
   no_of_busstop_500m         498.818
   prox_busstop                22.877
   prox_shoppingmall           76.087
   prox_mrt                    30.284
   prox_cbd                    31.182
   ************************Diagnostic information*************************
   Number of data points: 25713 
   Effective number of parameters (2trace(S) - trace(S'S)): 1781.954 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 23931.05 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 399410 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 397839.9 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 384991.8 
   Residual sum of squares: 7472900444 
   R-square value:  0.3305169 
   Adjusted R-square value:  0.2806638 

   ***********************************************************************
   Program stops at: 2024-10-26 11:42:30.587395 

Notice that the R-square value of GWR model is 0.2806638. Which is significantly better than the Multiple Linear Regression model (0.03708). However, it is still quite low, unlike what we have in our Hands-on exercise

Based on the documentation, the following are the options of each arguments that we can use to allow our user to select in our Shiny App:

  • approach: only for bw.gwr() method

    • CV => cross-validation approach

    • AIC => Akaike Information Criterion corrected

  • kernel:

    • gaussian => Applies a smooth, continuous weighting that decreases exponentially as the distance increases.

    • exponential => Similar to Gaussian, but with a sharper drop-off in weights as distance increases.

    • bisquare => Provides weights that sharply drop to zero after a certain distance (bandwidth).

    • tricube => Similar to the bisquare but with a gentler tapering off of weights.

    • boxcar => Provides uniform weights to all observations within the bandwidth and zero weight outside.

  • adptive:

    • TRUE => calculate an adaptive kernel where the bandwidth (bw) corresponds to the number of nearest neighbours.

    • FALSE => indicate bandwidth is a fixed distance

  • longlat:

    • TRUE => great circle distance will be calculated

    • FALSE => Euclidean distance

5.3 Visualising GWR Output

5.3.1 Converting SDF into SF data.frame

In order for us to visualise the fields in SDF, we need to first covert it into sf data.frame by using the code chunk below.

Click to expand/collapse code
rental_sf_adaptive <- st_as_sf(gwr.adaptive$SDF) %>%
  st_transform(crs=3414)

gwr_adaptive_output <- as.data.frame(gwr.adaptive$SDF)

rental_sf_adaptive <- cbind(rental.res.sf, gwr_adaptive_output)

We can use glimpse() to display the content of rental_sf_adaptive sf data frame.

glimpse(rental_sf_adaptive)
Rows: 25,713
Columns: 54
$ rent_approval_date         <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-0…
$ town                       <chr> "YISHUN", "JURONG WEST", "SENGKANG", "KALLA…
$ flat_type                  <chr> "4-ROOM", "4-ROOM", "5-ROOM", "3-ROOM", "5-…
$ monthly_rent               <dbl> 3700, 3900, 3700, 3600, 4350, 3000, 3800, 3…
$ region                     <chr> "NORTH REGION", "WEST REGION", "NORTH-EAST …
$ no_of_kindergarten_500m    <dbl> 1, 1, 0, 1, 1, 6, 3, 1, 1, 1, 1, 0, 2, 4, 0…
$ prox_kindergarten          <dbl> 3.717727e+02, 1.241314e+02, 6.531547e+02, 4…
$ no_of_childcare_500m       <dbl> 10, 9, 4, 3, 6, 11, 9, 9, 7, 6, 4, 1, 8, 10…
$ prox_childcare             <dbl> 6.318731e+01, 7.642944e+01, 8.264710e+01, 4…
$ no_of_hawker_500m          <dbl> 1, 0, 0, 1, 2, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0…
$ prox_hawker                <dbl> 478.4537, 840.4254, 660.6058, 332.1417, 354…
$ no_of_busstop_500m         <dbl> 17, 17, 6, 11, 13, 17, 15, 8, 16, 11, 5, 6,…
$ prox_busstop               <dbl> 174.00119, 80.37739, 70.48567, 161.93848, 2…
$ no_of_shoppingmall_1km     <dbl> 1, 1, 1, 2, 0, 3, 1, 2, 1, 2, 1, 0, 3, 4, 0…
$ prox_shoppingmall          <dbl> 704.70468, 905.82230, 735.18898, 565.82028,…
$ prox_mrt                   <dbl> 1259.97262, 1869.01210, 197.18773, 175.9875…
$ prox_prisch                <dbl> 271.15943, 1353.83517, 86.23193, 234.73109,…
$ prox_cbd                   <dbl> 15605.314, 14916.316, 12419.101, 2871.311, …
$ MLR_RES                    <dbl> 835.98147, 1065.55901, 671.21045, 286.82688…
$ Intercept                  <dbl> 10566.3179, 2974.1457, 2876.3821, 3078.9213…
$ no_of_kindergarten_500m.1  <dbl> -169.964431, 30.326000, 37.658655, 33.57878…
$ no_of_hawker_500m.1        <dbl> -222.291323, -475.642475, 66.682670, -39.00…
$ prox_hawker.1              <dbl> 0.289383836, -0.225557695, -0.073026306, 0.…
$ no_of_busstop_500m.1       <dbl> 4.0858496, -1.2413400, 6.0722980, -17.78274…
$ prox_busstop.1             <dbl> -1.169334775, -1.068334128, 0.195491192, -1…
$ prox_shoppingmall.1        <dbl> 0.39208613, -0.23662726, 0.11922886, 0.4814…
$ prox_mrt.1                 <dbl> -0.50908031, -0.02755118, 0.11343355, -0.93…
$ prox_cbd.1                 <dbl> -4.656794e-01, 5.028924e-02, -9.148488e-04,…
$ y                          <dbl> 3700, 3900, 3700, 3600, 4350, 3000, 3800, 3…
$ yhat                       <dbl> 2546.318, 3192.229, 2977.015, 3385.646, 303…
$ residual                   <dbl> 1153.6821, 707.7709, 722.9846, 214.3543, 13…
$ CV_Score                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Stud_residual              <dbl> 2.1416795, 1.3108566, 1.3284811, 0.3960232,…
$ Intercept_SE               <dbl> 2609.1496, 589.9064, 1496.2782, 845.4407, 3…
$ no_of_kindergarten_500m_SE <dbl> 92.12708, 58.08166, 38.91825, 110.64519, 10…
$ no_of_hawker_500m_SE       <dbl> 160.56359, 147.34741, 151.92818, 177.07317,…
$ prox_hawker_SE             <dbl> 0.40234634, 0.37455192, 0.33002928, 0.48420…
$ no_of_busstop_500m_SE      <dbl> 28.352175, 17.673734, 7.321265, 30.781031, …
$ prox_busstop_SE            <dbl> 0.8495247, 1.0154354, 0.6445039, 1.6671948,…
$ prox_shoppingmall_SE       <dbl> 0.3442941, 0.1203734, 0.2128390, 0.5452087,…
$ prox_mrt_SE                <dbl> 0.22397036, 0.09584573, 0.32357229, 0.56633…
$ prox_cbd_SE                <dbl> 0.16803460, 0.02776851, 0.12414654, 0.40534…
$ Intercept_TV               <dbl> 4.04971719, 5.04172516, 1.92235782, 3.64179…
$ no_of_kindergarten_500m_TV <dbl> -1.84489102, 0.52212700, 0.96763483, 0.3034…
$ no_of_hawker_500m_TV       <dbl> -1.38444418, -3.22803415, 0.43890916, -0.22…
$ prox_hawker_TV             <dbl> 0.71924063, -0.60220676, -0.22127220, 0.600…
$ no_of_busstop_500m_TV      <dbl> 0.14411062, -0.07023643, 0.82940563, -0.577…
$ prox_busstop_TV            <dbl> -1.37645760, -1.05209459, 0.30332043, -1.15…
$ prox_shoppingmall_TV       <dbl> 1.13881178, -1.96577679, 0.56018340, 0.8829…
$ prox_mrt_TV                <dbl> -2.27298068, -0.28745335, 0.35056633, -1.64…
$ prox_cbd_TV                <dbl> -2.771330134, 1.811016892, -0.007369104, 0.…
$ Local_R2                   <dbl> 0.29623626, 0.25789347, 0.06706073, 0.32697…
$ geometry                   <POINT [m]> POINT (29463.95 45634.94), POINT (157…
$ geometry.1                 <POINT [m]> POINT (29463.95 45634.94), POINT (157…

5.3.2 Visualising Local R2

Local R2 values range between 0.0 and 1.0 and indicate how well the local regression model fits observed y values. Very low values indicate the local model is performing poorly. Mapping the Local R2 values to see where GWR predicts well and where it predicts poorly may provide clues about important variables that may be missing from the regression model.

tmap_mode("view")
tm_shape(mpsz_svy21)+
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.1) +
tm_shape(rental_sf_adaptive) +  
  tm_dots(col = "Local_R2",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))

6. UI Design

The below are the UI design for our Shiny App, allowing users to interactively explore the relationships among the independent variables.

6.1 Correlation Matrix

Figure 1: Proposed Correlation Matrix Plot

The figure above shows some of the interactive features that allow users to interact with options such as Method, Type, and Order. Note that the hclust.method field will only appear if the user selects hclust in the Order field. After selecting all the fields, the user needs to click the submit button, and the correlation matrix plot will update accordingly. The main objective of the page is to allow users to identify relationships between different independent variables.

6.2 GWR

Figure 2: Proposed GWR Local R2 Plot

The GWR plot page will display the distribution of the local R² scores, giving users an idea of the accuracy of the GWR model. A higher local R² indicates better explainability of the model, while a lower value suggests poorer explainability.

6.3 Other Consideration

While my primary focus will be on the main page of our Shiny App, I may also include a variable selection field to allow users more customization if the rendering time is manageable.

I may add additional pages to display more details about the GWR model, such as a summary page, and potentially include a separate page for the Multiple Linear Regression model, where the four types of diagnostic tests are displayed.

Moving forward, I will begin exploring how to shift the above work from Quarto to Shiny and start testing rendering performance. If real-time rendering takes too long, I will pre-compute values for different options so that users won’t experience long wait times when using our Shiny App.